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ABSTRACT 

A new method of hierarchical clustering of graph vertexes is 
suggested. In the method, the graph partition is determined 
with an equivalence relation satisfying a recursive definition 
stating that vertexes are equivalent if the vertexes they point 
to (or vertexes pointing to them) are equivalent. Iterative 
application of the partitioning yields a hierarchical cluster- 
ing of graph vertexes. The method is applied to the citation 
graph of hep-th. The outcome is a two-level classification 
scheme for the subject field presented in hep-th, and index- 
ing of the papers from hep-th in this scheme. A number of 
tests show that the classification obtained is adequate. 

1. INTRODUCTION 

With the advent of the Internet, scientific literature comes 
closer to realizing the notion of Knowledge Network jOj, with 
the papers as the information units, and the references to 
other papers as the links of the network. An important 
feature in the organization of scientific knowledge is its rep- 
resentation as a hierarchy of developing and transforming 
scientific themes. The search for algorithms that would be 
able to reveal this hidden hierarchy analyzing the network 
structure had been initiated in the seventies |15l and 
continued until now ^1 ^ 0. Most of the present day 
clustering algorithms involve a number of free parameters 
(e.g., number of clusters, number of hierarchy levels, citation 
threshold, etc.). The values of the free parameters are fixed 
from external considerations. There is normally a strong de- 
pendence of the clustering results on the values of the free 
parameters. As a result, variation of the parameters yields 
a too broad set of clusterings ranging from the trivial clus- 
tering (with a single cluster) to the maximally refined one. 

Under the Open Task IV of the KDD Cup 2003, we for- 
mulate the question: Do there exist nontrivial hierarchical 
graph clusterings that would be uniquely determined by the 
graph structure, or, otherwise, would be weakly dependent 
in a certain sense on the free parameters of the clustering 
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procedure, if the latter are present? Practically, the weak 
dependence can be defined as a weak dependence of the char- 
acterizations like the number of clusters, the number of hier- 
archy levels, etc. We answer the question formulated above 
in the affirmative, and suggest a new algorithm, EqRank that 
performs a hierarchical clustering solving the problem. The 
problem is considered for the case of directed graphs. The 
algorithm is applied to the hep-th citation graph. As a re- 
sult, we obtain a classification scheme of the subject field 
presented in hep-th, and indexing of papers in this classifi- 
cation scheme. 

The reminder of this paper is organized as follows. In sec- 
tion 2, we give an informal explanation of the EqRcink algo- 
rithm, formalize the intuitive presentation, and demonstrate 
that EqRank is closely related to the recursive algorithms ex- 
emplified by the HITS algorithm We end this section 
with a crude estimate of the time complexity of EqRank. In 
section 3, we present the results yielded by EqRank applied to 
the hep-th citation graph. Section 4 contains a discussion of 
the results obtained, and lists some unanswered questions. 

2. THE ALGORITHM 

In this section we describe the EqRank algorithm. It can 
be applied to any directed graph. Small modifications may 
be needed to fine-tune it to a particular setting. The setting 
that have been motivating us in the development of EqRank 
is the setting of a citation graph. 

2.1 Informal Explanation 

We explain the idea behind the EqRank algorithm using 
the terms natural for a citation graph. Assume that we 
have learned somehow the way to compute the local hub 
paper LH{p) for each paper p. LH(p) cites p and is the 
most representative paper among the papers developing the 
ideas of p. The existence of the mapping LH generates the 
trajectory {p, LH{p), LH{LH{p)), ...) consisting of the se- 
quence of papers where every next paper is the local hub of 
the previous paper. The trajectory starts at the paper p, 
and ends at the paper RH{p), which has no citations. We 
call the end point of the trajectory the root hub of the pa- 
per p. Let us introduce an equivalence relation on the set 
of papers: p ~ p' if RH(p) = RH(p'). The corresponding 
equivalence classes are called the modem themes. A modern 
theme is formed with the papers that share a common re- 
sulting paper, the root hub, which is the paper underscoring 
the present state of the root theme. In complete analogy, 
starting with existence of the local authority LA{p) that is 



the paper cited by p, and is the most representative paper 
among the papers on which p is based, we determine the 
partition of the set of papers into classic themes. Each pa- 
per in a classic theme has one and the same paper as its 
root authority. Frequently, a root authority is a seminal pa- 
per initiating a new direction of research. We call simply 
the themes the elements of the partition yielded by intersec- 
tion of the hub partition and the authority partition. All 
the papers of a theme have one and the same root hub and 
authority papers. A modern theme considered as a graph 
whose vertexes are the papers of the theme, and the links 
are the links of the citation graph of the form (p, LH(p)) is 
an out-tree |5] whose root coincides with the root hub of the 
theme. Similarly, a classic theme is an in-tree whose root 
coincides with root authority of the theme. 

Restricting consideration to a citation graph, it is natural 
to define the mapping value LH(p) as the paper on which a 
weight function W{jp,P) reaches its maximum: 

W{p,LH{p)) = max(W(p,P)). 

Here W{p, P) is a nonnegative function defining the weight 
(relevance) of the link from P to p. We can use as W the 
co-citation |15| . the bibliographic coupling or other link- 
based measures of proximity I10| . In the actual experiment 
we performed over the hep-th citation graph, we used as 
W a linear combination of the first two of the measures of 
proximity mentioned above (more precisely, the reduction 
of this function onto the links of the citation graph under 
consideration) . 

2.2 Formal Description of the Algorithm 

In the informal description of the algorithm, it was im- 
plicitly implied that there is a unique local hub (authority) 
for each paper, and that the graph is acyclic. In applica- 
tions, both conditions are violated. The following formal 
description is valid without these simplifying assumptions. 

Let G = G{V, E, W) be a weighted directed graph, where 
V is the set of the graph vertexes, E is the binary relation 
on V that defines the links, W is the nonnegative function 
on E that defines the weight of the links. Let PS{V) be the 
power set, i.e., the set of all subsets of V, and FS{V) be 
the final set, i.e., the subset of V singled out by the absence 
of the links outgoing from its elements. SCR denotes the 
strong connectivity relation on the vertexes of the graph. 

2.2.1 Auxiliary Operations Acting on the Graphs 

Operation 1. The result of factoring of GiV, E, W), 

G/R = G{V/R,E/R,W*) 

is the factor graph of G taken by the equivalence relation R. 
Here V/R is the set of the equivalence classes with respect 
to R, E/R is the binary relation on V/R induced by E, 
i.e., X(E/R)Y if there exist such representatives of these 
classes x and y that xEy. (We write xEy if {x,y) € E.) 
The function W* is defined as the sum of the weights of 
all the links joining the elements belonging to the different 
equivalence classes. 

Operation 2. The result of inversion of G{V,E, W), 

In{G) = G{V,E-\W') 

is the same graph as G, but with the directions of the links 
inverted; W'{p,p') = W(j)',p). 



Operation 3. The result of retaining in G{V, E,W) only 
the maximally weighted links, 

Max{G)^G{V,E,^a.,W') 

is the graph whose set of links Emax is the subset of the 
maximally weighted links, Emax ?^ E, 

{x,y) e Emax if W{x,y) = max{W{x,z)). 

z 

W' is the restriction of the function W on Emax- 

Operation 4- Graph G can be transformed to a function 
RootiG) : V -» PSiV), 

Root{G){p) C FS{V). 

This subset is singled out in FS{V) by the property that 
each of its points is reachable from p along the links. 

2.2.2 Equivalence Relations 

We define the following three equivalence relations on V, 
HubR{G), AuthR{G), and EqRank{G). 

X ^ y with respect to the partition HubR{G) if 

Root{Max{In{G)) / SCR){x) = Root{Max{In{G))/ SCR){y); 

X ^ y with respect to the partition AuthRiG) if 

Root{Max{G)/SCR){x) = Root{Max{G)/SCR){y); 

The desired EqRank partition is defined as follows: 

EqRank{G) = HubR{G) Ci AuthR{G). 

Some notes are in order. The operation Max{G) (or 
Max{In{G))) keeps in the graph only the maximal outgo- 
ing (or ingoing) links, which link the papers to their local 
authorities (hubs). We do not assume anymore that local 
authorities and hubs are unique. Because of this, a clas- 
sical (modern) theme has as its root not a single paper, 
but a set of papers reachable from each paper of the theme 
along the maximal links. Let us comment on the presence of 
the factoring with respect to SCR in the above definitions. 
Without it, going along the links could be jammed on the 
cycles of the graph (see subsection 12.31 for extra motivation 
of this factoring). 

With the above definitions, the algorithm Eqrank we sug- 
gest for hierarchical clustering is defined as follows. 

• The input of EqRank is a directed weighted graph G 

• The output of the algorithm is the sequence of reduced 
graphs G = Go, Gi, where d — d-i / EqRank{Gi-i) 

The sequence terminates when Gi ~ Gi_i, i.e., the number 
of vertexes of d coincides with the number of vertexes of 
Gi-i. 

2.3 A Recursive Definition and Related Works 

In this subsection, we demonstrate that above equivalence 
relation EqRank is a natural development of the recursive 
algorithms PageRank 17, HITS 0, and SimRank Jl, which 
became lately quite popular among the network miners. 

In |7] the proximity measure SimRank was introduced 
for relational data. Its definition is based on the simple 
idea that close (similar) objects should be related to close 
(similar) objects. We use the same kind of a recursive def- 
inition to define an equivalence relation. In this case, we 



say that objects are equivalent if they are linked to equiva- 
lent objects. We will demonstrate that the above EqRank 
equivalence relation results from the above recursive defini- 
tion. (The similarity between SimRank and EqRank have 
motivated the name of the latter equivalence relation.) Be- 
low we give the exact definitions. Let G{V, E) be a directed 
acyclic graph, V be its set of vertexes, £■ be a binary re- 
lation on V . The binary relation E defines the mapping 
Fe:V ^ PS{V): 

Fe{x) = {yeV: xEy}. 

If the above formula yields Fe{x) — 0, we set by definition 
Fe{x) = X. We extend the mapping Fe to a mapping defined 
on the whole set of subsets of V, PS{V). This is achieved 
by the formula 

Fe{X) = [J^exFe{x). 

The equivalence relation we want to define should have 
the property that a; ~ i/ if Fe{x) ~ Fe{x). This definition 
falls short to be recursive, because it involves the relation 
between the vertexes and the relation between the subsets 
simultaneously. Since F is a subset of PS{V), and an equiv- 
alence relation on a set generates an equivalence relation on 
any subset, the desired equivalence relation can be recur- 
sively defined as follows: For any X,Y € PS{V), X Y 
with respect to the partition EqRank' if Fe{X) ~ Fe{Y). 

The above definitions imply that x ~ j/ if Fe^(x) ~ 
Fe^{y) for any n. Evidently, for any acyclic graph, Fe"{x) = 
Root{G){x) for n large enough (the Root{G) is defined in 
subsection imi . Note that Root{G){V) = FS{V), and the 
mapping Fe acts trivially on FS{V). Thus, there are many 
solutions to the above recursive equation for EqRank' . To 
single out a particular solution, one has to define an equiv- 
alence relation on FS{V). Let us use the finest equivalence 
relation on FS{V): 

X ~ y if X = F, 

where X, Y are subsets of FS{V). 
At this choice we have 

EqRank' {Max{G)) = AuthR{G), 

and 

EqRank' {Max(In(G))) = HubR{G) 

if G is an acyclic graph. Notice that this EqRank' (G) is 
the finest equivalence relation satisfying the above recursive 
equation. 

EqRank is expressed in terms of EqRank' with a simple 
formula: 

Eqrank(G) = EqRank' {Max{G))nEqRank' {Max{In{G))). 

Lastly, we point out that solving recursively the equation 
for EqRank' can be trapped in cycles if they exist. By this 
reason, we apply the factoring with respect to the strong 
connectivity relation to reduce the problem to the case of 
acyclic graph. 

2.4 Time Complexity of the Algorithm 

The algorithm consists of the operations HOI Thus, we 
have to estimate the time complexity of these operations. 
Evidently, Toi E + V, where E is the number of links; 
To2 ~ E; To3 ~ E. Most time consuming is the operation 0] 



because it requires computation of the transitive closure on 
the graph. Its time complexity is V{E + V). 

We point out that we observed linear dependence of the 
time complexity on the number of vertexes up to the scale 
of 10'* for the number of vertexes. This may be related to a 
considerable simplification of the graph after the application 
of the Max operation (see subsection 12.2. it . About 76% 
(91%) of the vertexes of the graph PIax{G) {Max{In{G))) 
had the unit out-degree in our experiments. 

3. HIERARCHICAL CLUSTERING OF HEP- 
TH CITATION GRAPH 

As mentioned above, application of EqRank in a concrete 
setting may require a fine-tuning. Here we specify the mod- 
ification of EqRank that was applied to the hep-th citation 
graph, and present the results obtained. 

The graph under consideration consists of 27,240 vertexes, 
and 342,437 links. It contains a number of weakly connected 
components. The largest component has 26,870 vertexes. 
The rest of 370 papers fragments into 229 of small (less 
than 5 papers) weakly connected components. A consid- 
eration of the reference lists of the papers from the small 
components reveals the reason behind the presence of these 
small components: most of the papers cited from the small 
components do not belong to hep-th, and, therefore, escape 
from the citation graph under consideration. 

The weight of a link was taken to be a linear combination 
of the co-citation and bibliographic coupling, 

W{x,y) = aA^A+{l-a)AA^, (1) 

where A is the adjacency matrix of the graph. We set a = 
0.9 The closeness of a to unit refiects that we consider co- 
citations as a more adequate measure of the importance of a 
link (we do not set a = 1 to avoid degeneracy of the weight 
function) . 

Evidently, the clustering of the weakly connected compo- 
nents can be performed independently for each component. 
The results we present below have been obtained by apply- 
ing EqRank to the largest weakly connected component. 

3.1 Determination of the Number of Themes 

Applying literally EqRank to hep-th yielded a too refined 
clustering. The total number of clusters turned out to be 
11,299. The reason for the existence of such a large number 
of small clusters is as follows. The themes are singled out by 
their root hubs and authorities in our approach. The root 
hubs are characterized by the absence of links pointing to 
them (they are recent papers with no papers citing them). 
The total number of such papers is too high to keep them 
all as representatives of relevant themes. A similar consid- 
eration can be applied to root authorities after inversion of 
the links. 

To obtain a meaningful reduction of the number of themes, 
we considered as "actual" themes the themes whose num- 
ber of papers was exceeding a cutoff value. In the present 
experiment the cutoff value was taken to be 20 papers. See 
below for an analysis of the dependence of the classification 
on the cutoff. For the above value of the cutoff, the num- 
ber of actual themes turned out to be 136. The rest of the 
themes were glued to the subset of actual themes, each small 
theme, to the "closest" large theme. The closeness between 
the themes A and B was computed as the sum of weights of 



the links between the themes regardless of the direction of 
a link. Ultimately, the largest theme turned out to contain 
3586 papers, and the smallest, 26 papers. 

3.2 Determination of the Number of the Hier- 
archy Levels 

The themes formed after the first stage clustering of the 
hep-th citation graph form the vertexes of the factor graph 
(sec subsection 12.2.1^ . This factor graph was clustered with 
EqRank. It yielded 19 themes of the second hierarchy level. 
At this level, the largest theme contains 15,410 papers, and 
the smallest, 65. One more application of EqRank yielded the 
trivial clustering. All the papers merged to a single cluster. 

We point out that the cutoff was used only to generate 
the first level of the hierarchy. Let us consider the hierarchy 
that would appear without the cutoff. Without the cutoff, a 
star-like graph would appear at the third level of the hierar- 
chy instead of the trivial graph consisting of a single vertex. 
There would be a single super-cluster of 7228 papers, and 
a multitude of small themes each of which would be con- 
nected to the super-cluster either by ingoing or outgoing 
links. Further application of EqRank would shrink the star- 
like graph by absorbing a number of the small themes into 
the super-cluster. Starting from the third hierarchy level, 
the exponential reduction of the number of vertexes in the 
factor graph would switch over to linear reduction. 

Based on the above we claim that EqRank allows comput- 
ing the number of hierarchy levels implied by the structure 
of a graph. For the hep-th citation graph, there are two 
levels in the hierarchy. 

In conclusion, we discuss in more details the dependence 
of the classification obtained for hep-th papers on the cut- 
off, which is essentially the only parameter involved in the 
procedure. The experiments performed have demonstrated 
that the characteristic most stable against changes in the 
cutoff is the number of hierarchy levels. It stays invariable 
as soon as the cutoff exceeds a critical value. (This sets 
an interesting mathematical problem of understanding the 
number of hierarchy levels as an invariant characteristic of a 
directed graph.) Next in stability is the number of clusters 
on the first hierarchy level. Reduction of the cutoff simply 
adds new small clusters without changing the upper part of 
the list of clusters. The most involved is the dependence 
on the cutoff of the higher hierarchy levels. Not only new 
small themes may appear, but there may be also merging 
and splitting of clusters in the upper part of the cluster list 
at a reduction of the cutoff. 

We summarize the above discussion as follows. We ob- 
tained a set of classifications depending on a single parame- 
ter Fcut , the minimal number of papers in a theme of the first 
level. Each classification C is a finite sequence of graphs, 

C = {Go, Gl{Fcut), Gn + l{Fcut)}- 

The sequence terminates on a trivial graph (more generally, 
on a graph invariant with respect to application of EqRank). 
The length of the sequence is characterized by the number 
n, the number of the levels in the hierarchy. The latter also 
depends on Fcut- The dependence is as follows: 

n{Fcut) = const if F^ut ^ Fmin, 

and 

n{Fcut) = /(-fcut) if Fcut < Fmin, 



where / is a function growing fast at Fcut decreasing. In the 
case of hep-th, const = 2, and Fmin = 8. The qualitative 
change of the dependence of the number of hierarchy levels 
on Fcut taking place at Fmin suggests that it is natural to 

set Fcu t 5^ Fmin • 

3.3 Theme Dynamics 

A study of time dynamics of the themes is a part of our 
experiment with hep-th. Specifically, time dependence of 
the size of the themes at different levels of the hierarchy was 
considered. A brief account of the results is given in Table 1. 
There is an overall increase of the number of papers posted 
to hep-th each year. It is distributed unevenly between the 
themes. Analyzing this distribution allowed us to classify 
the clusters into four groups depending on the character of 
the evolution trend. The trend was computed for the period 
1992-2002. The data is presented on the plots at 
http: //hepstructure . inr . ac . ru/hep-th/Theme_dyn.htm. 

The first group of the clusters (the trend clusters) in- 
cludes "growing" themes. There are 10 themes in this group. 
The second group (the "— " trend clusters) includes "fading" 
themes. There are 5 themes of this sort. The third group 
(the "0" trend clusters) includes 2 themes characterized by a 
stable number of the papers appearing per year. The fourth 
group (the "H — h" trend clusters) includes 2 themes. They 
are "emergent" themes characterized by explosive growth of 
the number of papers appeared in 2002. 

Let us make a comment on the emergent themes (the 
fourth group). Based on the appearance of the plots of time 
dependence we speculate that if we would cluster the hep-th 
citation graph based on the data restricted to the period 
1992-2001 (we plan to make this exercise), the clusters cor- 
responding to the emergent themes would not appear at all. 
In other words, we speculate that the themes of the fourth 
group were born namely in 2002. 

3.4 An Estimate of the Clustering Quality 

The quality of the classification obtained for hep-th can 
be safely estimated only with an analysis performed by ex- 
perts in this subject field (see however the Appendix). In 
this section, we give a formal estimate based on the analysis 
of the citation graph itself. In [3, in the context of web 
clustering, the notion of "ideal community" was introduced. 
It is a subset of vertexes with the following property: the 
sum of weights of the inner outgoing links is bigger than 
the sum of weights of the outer outgoing links. Here the in- 
ner links join the vertexes of the "ideal community" subset, 
and the outer links are the links starting at the subset, and 
going out of it. In line with this definition, we computed 
the so-called "community index", which is the ratio of the 
sum of weights of the inner links of a cluster to the sum of 
weights of the inner and outer links. If the community index 
exceeds 0.5, the community is ideal; the larger the commu- 
nity index, the more ideal is the community. As an overaU 
characteristic, a weighted mean value of the community in- 
dex over all the clusters was computed. Table 2 gives the 
values of the community index for the themes of the second 
level of the hierarchy. As seen, 18 of the 19 themes com- 
ply with the formal definition of the ideal community. The 
weighted mean value of the community index is 0.88 for the 
themes of the second level. The situation is less satisfactory 
for the themes of the first hierarchy level. About half of 
the themes of the first level comply with the definition of 



Table 1: Theme Dynamics 



Theme Number 


Number of Papers 


Theme Label 


Trend 


1 


15,410 


adf/cft correspondence 


+ 


2 


4.118 


noii-commntativo geometry 


+ 


■> 
'} 


9()cS 


tadiyoii I'oiideiisatioii 


+ 


4 


858 


stokes theorem 


+ 


5 


673 


iib oricntifolds 


+ 


6 


578 


form factors; ising model 


— 


7 


515 


affine todda 


— 


8 


501 


dilaton gravity 


— 


9 


477 


higher spin 


+ 


10 


457 


pp-wave background 


++ 


11 


434 


n = 2 string 




12 


414 


renormalization group 


+ 


13 


385 


string cosmology 





14 


376 


random matrix 


+ 


15 


233 


bethe ansatz 





16 


180 


geometric entropy 




17 


180 


rolling tachyon 


++ 


18 


108 


gauged supergravity 


+ 


19 


65 


taub-nut background 


+ 



Table 2: Community Index 



Theme Number 


Number of Papers 


Theme Label 


Comm. Index 


1 


15,410 


adf/cft correspondence 


0.95 


2 


4,118 


non-commutative geometry 


0.76 


3 


908 


tachyon condensation 


0.78 


4 


858 


stokes theorem 


0.86 


5 


673 


iib orientifolds 


0.68 


6 


578 


form factors; ising model 


0.87 


7 


515 


affine todda 


0.89 


8 


501 


dilaton gravity 


0.83 


9 


477 


higher spin 


0.62 


10 


457 


pp-wave background 


0.89 


11 


434 


n = 2 string 


0.80 


12 


414 


renormalization group 


0.96 


13 


385 


string cosmology 


0.81 


14 


376 


random matrix 


0.79 


15 


233 


bethe ansatz 


0.72 


16 


180 


geometric entropy 


0.85 


17 


180 


rolling tachyon 


0.66 


18 


108 


gauged supergravity 


0.37 


19 


65 


taub-nut background 


0.90 



the ideal community. Despite this, the weighted mean value 
of the community index exceeds 0.5 at this level also. It is 
0.58. 

3.5 Theme Representation 

The problem of presenting a theme was not a central one 
for this experiment. Despite this, we gave the themes a num- 
ber of attributes helping to grasp the content of a theme, 
and recognize it. The list of themes with their attributes 
is available at http://hepstructure.inr.ac.ru/hep-th/. 
Specifically, the following attributes were determined for 
each theme: 

• Theme Label. This is a sequence of seven pairs of 
words naming the theme. These sevens were deter- 
mined by a modification of Frequent and Predicative 
Words Method explained in [I4| . This method seeks 
for a word that is optimally unique and common for 
a cluster. The modification is that we sought not for 
separate words but for pairs of consecutive words (in 
this, prepositions and common words from a stop list 
were ignored). The body of the analyzed text was 
composed from the titles of the papers of the theme. 

• Authority and Hub Papers. For each paper of a clus- 
ter, a pair of numbers [Authority Number, Huh Num- 
ber) was computed. TheAuthority Number is 

v' 

where the sum runs over the papers of the theme for 
which p is the local authority paper (see subsection 
12. H . The Hub Number is the same sum but running 
over the papers of the theme for which p is the local 
hub (see subsection l2.1ll . On the above site, we list the 
first 10 papers ordered by decrease of their Authority 
Number, and Hub Number. 

• List of Main Authors. For each author whose papers 
are in a cluster, the pair of numbers [Author Author- 
ity Number, Author Hub Number) was computed. The 
Author Authority (Hub) Number is the sum of Author- 
ity (Hub) Number of the papers of the author from the 
cluster. On the above site, we list the first 10 authors 
ordered by decrease of their Author Authority Number, 
and Author Hub Number. 

4. CONCLUSIONS AND OUTLOOK 

To summarize, we suggested a new method of hierarchi- 
cal clustering of directed graphs. There is a free parameter 
in the method, the minimal acceptable number of vertexes 
in a cluster. Qualitative features of the dependence of the 
resulting number of hierarchy levels on this parameter al- 
lows setting it to a certain interval of natural values. The 
time complexity of the algorithm is at worst quadratic in 
the number of vertexes in the graph. 

We applied the method to the hep-th citation graph. In 
this application the weight of the links was defined as a linear 
combination of the co-citation and bibliographic coupling. 
The outcome is a two-level hierarchy of themes present in 
hep-th. 

We list below a number of problems for the future. 
It would be interesting to study the dependence of the 
clustering on the weight function. (We assume here that 



the weight function is a function of the graph.) We per- 
formed a preliminary study of the clustering dependence on 
the parameter a involved in the weight function W (see sec- 
tion]^. We observed that for an interval of values of a, 
factor graph obtained by the clustering is independent of o. 
However, some of the papers travel from cluster to cluster 
when a changes. 

Another interesting issue is the hierarchical clustering for 
random graphs that are popular among physicists [IT] . In 
this case, the distribution of the papers among the clusters, 
the number of levels in the hierarchy, and other characteris- 
tics of the clustering would be random quantities depending 
on the parameters of the random graph model. For exam- 
ple, we ask what is the difference between the clustering by 
EqRank of a graph generated by the classical (Poisson) graph 
models and more realistic models of "small world" . 

In conclusion, EqRank opens up interesting possibilities in 
hierarchical clustering of directed graphs. Its use for clus- 
tering of the hep-th citation graph resulted in a meaningful 
classification of the papers from hep-th. 
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APPENDIX 

A. AN IMITATION OF AN EXPERT ESTI- 
MATE 

Seeking for an imitation of an expert estimate of our clus- 
tering we selected arbitrarily a theme of the second level, the 
"string cosmology" theme. Google helped to find the home 
page http : //www. ba. infn. it/~gasperin/ of M. Gasperini, 
whose name is among the three top authority authors of the 
theme. At his page, M. Gasperini says that his page is de- 
voted to "string cosmology" . There is a list of more than 
100 papers on the subject at the site. We selected the papers 
from this list that are present in hep-th, and compared the 
resulting list with the list of papers of the "string cosmol- 
ogy" theme generated by EqRank. It turned out that 84% 
of the papers selected by M. Gasperini are also selected by 
EqRank. 

We did the same for a theme of the first level. It was the 
"two-time physics" theme. There is a brief review of this 
theme on the personal page of Itzhak Bars 
(http://physics.usc.edu/~bars/twoTph.htm) who keeps 
the first position in the list of Authority Authors of the 
theme. There is a list of 19 papers in the review. All of 
them were selected by EqRank. 

These facts support our opinion that EqRank generates an 
adequate classification of hep-th. 



